Before we even start, it is of crucial importance to first understand deeply linear and logistic regression. This is because (as we will see in later discussion) they are the very basic components in a general neural network model.
A linear model can be written in a matrix form
\[ Y = X\beta + \epsilon, \] where \(Y\) is the output or label vector of length \(N\) (number of observations), \(X\) is the input feature matrix (referred to as the design matrix in statistics) with dimension \(N\) by \(P\) (number of features), \(\beta\) is the model weights/coefficients (a column vector of length \(P\)) for which we’d like to solve, \(\epsilon\) is the model residual or error vector.
The classical way to solve for \(\beta\) is ordinary least squares. The idea of OLS is find out the weights that minimize the mean of squared model errors:
\[ \begin{aligned} \mbox{mse (loss)} &= \frac{1}{N}\sum_i\epsilon^2_i \\ &= \frac{1}{N}\sum_i(y_i - \beta x_i)^2, \end{aligned} \]
where \(y_i\) and \(x_i\) is the \(i\)-th observation. The first-order condition (requiring that the first-order derivative w.r.t. weights are zero) gives us the OLS solution for model weights \(\beta\) analytically (in matrix notation):
\[ \begin{equation} \label{eq:ols} \hat{\beta} = (X'X)^{-1}X'Y. \end{equation} \]
import numpy as np
from numpy.linalg import inv
def ols(X, y):
return inv(X.T.dot(X)).dot(X.T).dot(y)Let’s consider a toy model with only one non-constant feature:
\[ y_i = 6 + 4x_i + \epsilon_i. \]
In this model the outcome \(y\) is determined by a bias term plus a single variable \(x\), with an independently distributed noise term \(\epsilon \sim \mbox{Normal}(0, 1)\).
Create some random data generated from this model:
# Create toy example.
np.random.seed(777)
N = 1000
X = np.stack([np.ones(N), np.random.normal(size=N)], axis=1)
beta = np.array([6, 4], dtype=np.float32) # True model weights.
e = np.random.normal(size=N)
y = X.dot(beta) + e
print(X[:10])[[ 1. -0.46820879]
[ 1. -0.82282485]
[ 1. -0.0653801 ]
[ 1. -0.71336192]
[ 1. 0.90635089]
[ 1. 0.76623673]
[ 1. 0.82605407]
[ 1. -1.32368279]
[ 1. -1.75244452]
[ 1. 1.00244907]]
from plotly.offline import plot
import plotly.graph_objs as go
def plot_offline(data, layout, ofile):
p = plot({"data": data, "layout": layout},
filename=ofile, auto_open=False, include_plotlyjs=False)
return p
ofile = "plots/toy_reg.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=X[:,1], y=y, mode="markers")],
layout=go.Layout(title="Data Generated from Toy Model",
xaxis=dict(title="x"),
yaxis=dict(title="y")))Without consideration of the noise term, the OLS estimator will solve precisely for the true model weights:
[6. 4.]
Of course in the real world the noise term cannot be determined and usually the feature alone cannot explain entirely the variation in the outcome. This means that what we actually estimate will be the expected value of model weights:
[6.01132196 3.96247175]
By the Law of Large Number and Central Limit Theorem, the OLS estimator will converge in probability to the true model weights and distributed asymptotically Normal in large sample.1
The issue of the above approach is that equation \(\eqref{eq:ols}\) is not numerically stable when it comes to large-scale application where we may have lots of observations and lots of features. One very useful solution to solve the estimator numerically in large-scale application is the gradient descent approach.
Instead of solving the first-order condition analytically, we can do it numerically. Gradient descent is a 1st-order optimization technique to find local optimum of a given function.
In the model training exercise our target function is the loss so the optimization problem is:
\[ \operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_i(y_i - \beta x_i)^2. \]
That is, we’d like to figure out model weights that minimize the loss which is defined by the mean squared errors when the model is a regression model.
The idea of gradient descent is to
Let’s use the toy example to actually implement a gradent descent optimizer from scratch. First we re-write the loss function explicitly with our setup of one coefficient with a constant (\(\beta = [\beta_0, \beta_1]\)):
\[ \mbox{Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2. \]
Now the gradient (or equivalently the 1st-order derivative) w.r.t. to weights will be:
\[ \begin{aligned} \frac{\partial\mbox{Loss}}{\partial\beta_0} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big], \\ \frac{\partial\mbox{Loss}}{\partial\beta_1} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i. \end{aligned} \]
The corresponding python function can be coded as:
def grad_func(X, y, beta):
"""Calculate vectorized gradients."""
return -2 * ((y - X.dot(beta)).dot(X)) / X.shape[0]If we set the above equations to zero we can solve for \(\beta_0\) and \(\beta_1\) analytically and the solution will be exactly just equation \(\eqref{eq:ols}\). But as we just pointed out it suffers from numerical stability issue.
The minimum implementation of our gradient descent optimizer is just a few lines:
def gd_optimize(X, y, lr=.01, n_step=100):
b = np.random.normal(size=X.shape[1])
for step in range(n_step):
b -= lr*grad_func(X, y, b)
return b
print(gd_optimize(X, y, n_step=3000))[6.01132196 3.96247175]
As we can see the result is very closed to our analytical solution.
Learning rate is a hyper-parameter for gradient descent optimizer. The gradient update to our model weights is scaled down by the learning rate to make sure convergence. Too large the learning rate will explode the gradient. Too small the learning rate will slow down the convergence and sometimes result in the optimizer trapped at local sub-optimum.
Let’s re-write our gradient descent optimizer to also track the loss from each training step. And we use the same initialization for a fair comparison.
def loss(X, y, beta):
return ((X.dot(beta) - y)**2).mean()
def gd_optimize(X, y, lr=.01, n_step=100):
b = np.array([0.0, 0.0])
l = [loss(X, y, b)]
for step in range(n_step):
b -= lr*grad_func(X, y, b)
l.append(loss(X, y, b))
return b, lNow we run the optimization with a variety of different learning rates. For illustration purpose we will only run a few steps.
betas = {}
losses = {}
for lr in [.001, .01, .05, .1, 1]:
betas[lr], losses[lr] = gd_optimize(X, y, lr=lr, n_step=15)
for r, b in betas.items():
print("Learning Rate {:5} | Estimate: {}".format(r, b))Learning Rate 0.001 | Estimate: [0.18155223 0.12379996]
Learning Rate 0.01 | Estimate: [1.60013258 1.08670659]
Learning Rate 0.05 | Estimate: [4.81547513 3.22246947]
Learning Rate 0.1 | Estimate: [5.81500752 3.85150789]
Learning Rate 1 | Estimate: [19.8158439 18.42703331]
The result suggests that a learning rate of 1 is too large for our problem. The gradient explodes which make our solution diverge. And a lower learning rate in general converges slower to the optimum. Number of examples used to calculate the gradient also will affect the convergence behavior. In general if the sample size is too small a smaller learning rate should be used to avoid gradient explosion.
This can be more clearly seen if we plot the trace of our training losses:
ofile = "plots/toy_reg_loss_compare.html"
pdata = [go.Scatter(x=np.arange(len(losses[lr])), y=losses[lr], name=lr)
for lr in losses.keys()]
p = plot_offline(
ofile=ofile,
data=pdata,
layout=go.Layout(
title="Trace of Training Loss with Various Learning Rates",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))If the loss doesn’t decrease over training iteration, it is a signal that something is wrong with our model.
Unlike our toy implementation, in modern implementation of any numerical optimizer there will be a lot of techniques to do the best to avoid convergence failure. But it is the model developer’s responsibility to diagnose the training behavior before anything is delivered to the stakeholder. Checking the dynamics of loss is usually the first and quick step to examine whether the training task is functioning as expected.
The vanilla gradient descent optimizer we just implemented has one issue. Since for each update it needs to traverse over the entire dataset, it becomes too slow when it comes to large dataset.
Batch gradient descent is too overcome this issue. Instead of calculate the gradient using the entire dataset, we can use only a random subset of it. It will be less precise but statistically the result will be consistent.
Let’s implement a batch gradient descent optimizer:
def gd_batch_optimize(X, y, lr=.01, n_step=100, batch_size=64):
b = np.random.normal(size=X.shape[1])
l = [loss(X, y, b)]
for step in range(n_step):
sid = np.random.randint(X.shape[0], size=batch_size)
Xb = X[sid,:]
yb = y[sid]
b -= lr*grad_func(Xb, yb, b)
l.append(loss(Xb, yb, b))
return b, l
batch_beta, batch_loss = gd_batch_optimize(X, y, n_step=3000, batch_size=64)
print(batch_beta)[5.98189269 3.94839182]
Batch optimizer is currently the best practice of training neural networks in large scale application. The batch size depends on the actual application but usually ranges from 8 to 1024.
If we set the batch size as exactly 1, i.e., to calculate the gradient update only use one data observation at a time, this is referred to as stochastic gradient descent.
Here we introduce the term epoch: One epoch is for the optimizer to traverse the entire dataset once (no matter how many examples are used to calculate the gradient in each update). Number of epochs can be considered as another hyper-parameter of a model. We can also implement epoch in our previous batch gradient descent optimizer but for simplicity we ignore that. Let’s implement it in our SGD optimizer:
def sgd_optimize(X, y, lr=.01, n_epoch=100):
b = np.random.normal(size=X.shape[1])
l = [loss(X, y, b)]
for epoch in range(n_epoch):
sid = np.random.permutation(X.shape[0])
for i in sid:
b -= lr*grad_func(X[None,i,:], y[i], b)
l.append(loss(X[None,i,:], y[i], b))
return b, l
sgd_beta, sgd_loss = sgd_optimize(X, y, lr=.01, n_epoch=10)
print(sgd_beta)[5.98165419 3.88881224]
Plot the losses for the first 500 updates.
ofile = "plots/toy_reg_sgd_loss.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=np.arange(len(sgd_loss[:500])), y=sgd_loss[:500])],
layout=go.Layout(
title="Trace of SGD Training Loss",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))Unlike the vanilla gradient descent, stochastic gradient descent doesn’t ensure the loss is always decreasing. But statistiaclly it should converge to the same solution as the vanilla approach.
A logistic regression models the outcome probablistically:
\[ \begin{equation} \label{eq:logit} P(Y = 1) = \frac{1}{1 + e^{-X\beta}}, \end{equation} \]
Here the sigmoid function \(s(t) = \frac{1}{1 + e^{-t}}\) is used to transform a real number into probability space \([0, 1]\).
We can interpret the model as a linear model in log-odds. Assuming Y is binary and take a value of 0 or 1, the odds of \(Y = 1\) is defined as \(\frac{P(Y = 1)}{P(Y = 0)} = \frac{P(Y = 1)}{1 - P(Y = 1)}\). We can re-arrange the logistic model equation:
\[ \begin{aligned} \ln \Bigg[ \frac{P(Y = 1)}{1 - P(Y = 1)} \Bigg] &= \ln \Bigg[ \frac{\frac{1}{1 + e^{-X\beta}}}{\frac{e^{-X\beta}}{1 + e^{-X\beta}}} \Bigg] \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta}(1 + e^{-X\beta}) \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta} - \ln(1 + e^{-X\beta}) \\ &= - \ln e^{-X\beta} \\ &= X\beta. \end{aligned} \]
That is, the model weights are linear in the log-odds of our target outcome.
When the probability is transformed into odds, the range is transformed from \([0,1]\) to \([0,\infty]\).
# We plot only up to .95 since the value of odds will grow unboundedly.
prob = np.linspace(.001, .95, num=100)
odds = prob / (1 - prob)
log_odds = np.log(odds)
ofile = "plots/prob_odds.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=prob, y=odds)],
layout=go.Layout(
title="Event Probabilities to Odds",
xaxis=dict(title="Probability"),
yaxis=dict(title="Odds")))If we further transform odds to log-odds, the range is transformed from \([0,\infty]\) to \([-\infty,\infty]\).
ofile = "plots/prob_odds.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=odds, y=log_odds)],
layout=go.Layout(
title="Event Odds to Log-Odds",
xaxis=dict(title="Odds"),
yaxis=dict(title="Log-Odds")))Put it together is the effect of the sigmoid function:
# Note that instead of coding s = 1 / (1 + np.exp(-t)),
# which is numerically unstable,
# we should use a math trick to make it stable.
def sigmoid(t):
"""Numerically stable sigmoid."""
return np.exp(-np.logaddexp(0, -t))
t = np.linspace(-10, 10, num=100)
ofile = "plots/sigmoid.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=t, y=sigmoid(t))],
layout=go.Layout(
title="A Sigmoid Function",
xaxis=dict(title="Raw Value"),
yaxis=dict(title="Probability")))How do we solve for the model weights \(\beta\) in equation \(\eqref{eq:logit}\)? In the linear regression model we solve for the weights by minimizing the mean squared error. In logistic regression model we need to define measurement for modeling error as well.
Put it differently, we’d like to calculate the distance between our predicted probability and the real event label distribution (called the empirical distribution). In information theory the cross entropy is used to measure the distance between two probability distribution.
The entropy of a discrete probability distribution is defined as:
\[ H(p) = - \sum_i p_i \log_2p_i, \]
where \(p_i\) is the probability for event \(i\). It measures the uncertainty of a stochastic event.
Take a coin flip event as example. We plot the entropy value at different value of coin bias (the probability of having a head instead of a tail.)
def entropy_binary(p):
return -p*np.log2(p) - (1 - p)*np.log2(1 - p)
prob = np.linspace(.01, .99, num=100)
ofile = "plots/entropy.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=prob, y=entropy_binary(prob))],
layout=go.Layout(
title="Entropy of a Binary Event (Coin Flip)",
xaxis=dict(title="Probability of a Head"),
yaxis=dict(title="Entropy")))It is obvious that the entropy of this event is maximized when the probability of flipping a head is exactly 0.5. At this level the event has a highest level of uncertainty in a sense that it is the most difficult case to predict the outcome of a flip.
Now move on to cross entropy. Cross entropy between two discrete probability distribution \(p\) and \(q\) over the same support is defined as:
\[ H(p, q) = - \sum_i p_i \log_2q_i. \]
For our logistic regression model, distribution \(p\) is the empirical distribution (label distribution) and distribution \(q\) is our model predicted distribution. Denote \(q_i = P(y_i = 1)\) for the prediction of \(i\)-th example. The cross-entropy-loss of our model hence can be written as:
\[ \mbox{Cross-Entropy Loss} = - \frac{1}{N}\sum_i^N \bigg[ y_i\log_2q_i + (1 - y_i)\log_2(1 - q_i)\bigg], \]
where \(y_i \in \{0,1\}\) is the \(i\)-th binary training label and \(P(y_i = 1)\) the model prediction for the \(i\)-th example out of \(N\) total training examples. This is the mean value of cross entropy for each training example.
Since \(q_i = P(y_i = 1)\) is expressed by our model equation \(\eqref{eq:logit}\), now we can apply gradient descent to the loss function to find out the optimum model weights that minimize the cross-entropy loss.
Before we implement our optimizer for a logistic regression model, we demonstrate that minimizing cross entropy is indeed equivalent to maximizing data likelihood.
The data likelihood is just the product of all predicted probabilities for individual example, provided that they are independent.
\[ \mbox{likelihood} = \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i}. \]
The maximum likelihood estimator will try to maximize the log of the likelihood, which is:
\[ \begin{aligned} \mbox{log-lik} &= \log_2 \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \log_2 q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \bigg[ y_i\log_2 q_i + (1 - y_i)\log_2(1 - q_i) \bigg]. \end{aligned} \]
By taking the average as well, the negative log-likelihood is exactly the cross entropy.
Now let’s also implement a batch gradient descent optimizer for a logistic regression model. We will use the same toy example and create additional random binary labels for this exercise.
The gradient function must be derived with respect to model weights. To simplify the math we replace log of base 2 with natural log (with no impact on the optimization), and we take advantage of the fact that the derivative of the sigmoid function is:2
\[ \begin{aligned} \frac{ds(t)}{dt} &= \frac{d\frac{1}{1 + e^{-t}}}{dt} \\ &= -(1 + e^{-t})^{-2} \cdot (- e^{-t}) \\ &= \frac{e^{-t}}{(1 + e^{-t})^2} \\ &= \frac{1}{1 + e^{-t}} \cdot \frac{e^{-t}}{1 + e^{-t}} \\ &= s(t) \cdot [1 - s(t)]. \end{aligned} \]
The gradient in our univariate model w.r.t. the bias term will be:3
\[ \begin{aligned} \frac{\partial\mbox{LogLoss}}{\partial\beta_0} &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{\partial\ln q_i}{\partial\beta_0} + (1 - y_i)\frac{\partial\ln(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial q_i}{\partial\beta_0} + (1 - y_i)\frac{1}{1 - q_i}\frac{\partial(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t} - (1 - y_i)\frac{1}{1 - q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}q_i(1 - q_i) - (1 - y_i)\frac{1}{1 - q_i}q_i(1 - q_i)\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i(1 - q_i) - (1 - y_i)q_i \bigg] \\ &= - \frac{1}{N}\sum_i (y_i - q_i). \end{aligned} \]
Similary for the weight:
\[ \frac{\partial\mbox{LogLoss}}{\partial\beta_1} = - \frac{1}{N}\sum_i (y_i - q_i)x_i. \]
Now the Python code for batch gradient descent with log-loss:
def logloss(X, y, b):
logloss_pos = y * np.log(sigmoid(X.dot(b)))
logloss_neg = (1 - y) * np.log(1 - sigmoid(X.dot(b)))
return - (logloss_pos + logloss_neg).mean()
def logit_grad_func(X, y, b):
return - (y - sigmoid(X.dot(b))).dot(X) / X.shape[0]
def cross_entropy_gd_optimize(X, y, lr=.01, n_step=100, batch_size=64):
b = np.random.normal(size=X.shape[1])
l = [logloss(X, y, b)]
for step in range(n_step):
sid = np.random.randint(X.shape[0], size=batch_size)
Xb = X[sid,:]
yb = y[sid]
b -= lr*logit_grad_func(Xb, yb, b)
l.append(logloss(Xb, yb, b))
return b, l
log_beta, log_loss = cross_entropy_gd_optimize(X, y2, lr=.01, n_step=3000)
print(log_beta)[2.7018906 1.37265212]
ofile = "plots/toy_log_loss.html"
pdata = [go.Scatter(x=np.arange(len(log_loss)), y=log_loss)]
p = plot_offline(
ofile=ofile,
data=pdata,
layout=go.Layout(
title="Trace of Training Log-Loss",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))Comparing to linear regression, a logistic regression model is harder to converge. Since our naive implementation does not do convergence diagnostics, let’s use R’s built-in glm function which use Newton’s method (a 2nd-order optimizer utilizing not only 1st-order but also 2nd-order derivatives) to check the estimation result of our toy example:
(Intercept) x
7.279770 4.913451
Now increase both the learning rate and training step of our naive gradient descent optimizer:
[7.19221871 4.82282142]
Seems better.
Or we can use the Python package statsmodels (which also uses a 2nd-order optimizer by default) to check our result:
Optimization terminated successfully.
Current function value: 0.097274
Iterations 10
<class 'statsmodels.iolib.summary.Summary'>
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: Logit Df Residuals: 998
Method: MLE Df Model: 1
Date: Tue, 25 Jun 2019 Pseudo R-squ.: 0.6480
Time: 21:59:24 Log-Likelihood: -97.274
converged: True LL-Null: -276.32
Covariance Type: nonrobust LLR p-value: 7.328e-80
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 7.2798 0.721 10.100 0.000 5.867 8.692
x1 4.9135 0.534 9.195 0.000 3.866 5.961
==============================================================================
Possibly complete quasi-separation: A fraction 0.36 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
"""
Or we can use the popular machine learning library sklearn to do the same check:
from sklearn.linear_model import LogisticRegression
# We need to set C to an arbitrarily large number to suppress regularization
# since our naive approach doesn't implement any regularization.
LogisticRegression(C=1e16, fit_intercept=False).fit(X, y2).coef_array([[7.27975064, 4.91343756]])
/Users/kylechung/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
In the previous section we implement a simple gradient descent optimizer by manually derive the functional form of gradient on our own. This could be troublesome if our model becomes more and more complicated, as in the case of a deep neural net.
Automatci differentiation is a programming technique to calculate the gradient of any given function. One of the most popular library for this purpose is TensorFlow.
Let’s use tensorflow to implement our simple gradient descent optimizer again. But this time we will NOT explicitly derive the gradient function. Instead, we will only specify the target function which is just the loss function of our model.
/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning:
Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
tf.enable_eager_execution()
# Use tensor to represent our data.
# Note that we need to be very specific about dtype/shape of our tensors.
X_tf = tf.convert_to_tensor(X, dtype=tf.float32)
beta_tf = tf.reshape(tf.convert_to_tensor(beta, dtype=tf.float32), (2,1))
e_tf = tf.reshape(tf.convert_to_tensor(e, dtype=tf.float32), (N, 1))
y_tf = tf.matmul(X_tf, beta_tf) + e_tf
y2_tf = tf.reshape(tf.convert_to_tensor(y2, dtype=tf.float32), (N, 1))
def loss_mse_tf(X, y, beta):
y_hat = tf.matmul(X, beta)
return tf.reduce_mean(input_tensor = (y_hat - y)**2)
def logloss_tf_bad(X, y, beta):
# This can suffer from numerical stability issue.
logloss_pos = y * tf.log(tf.sigmoid(tf.matmul(X, beta)))
logloss_neg = (1 - y) * tf.log(1 - tf.sigmoid(tf.matmul(X, beta)))
return - tf.reduce_mean(logloss_pos + logloss_neg)
def logloss_tf(X, y, beta):
# tf.sigmoid is not numerically stable.
# But there is a convenient function to do the cross entropy calculation stably.
s = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=tf.matmul(X, beta))
return tf.reduce_mean(s)
def gd_optimize_tf(X, y, loss_func, lr=.01, n_step=100):
grad_func_tf = tf.contrib.eager.gradients_function(loss_func, params=[2])
beta = tf.random_normal((2, 1))
for step in range(n_step):
grad = grad_func_tf(X, y, beta)[0]
beta -= lr * grad
return beta.numpy()
# For the MSE loss problem.
print(gd_optimize_tf(X_tf, y_tf, loss_func=loss_mse_tf, n_step=3000))
# For the cross entropy loss problem.
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
[[6.0113106]
[3.9624662]]
[[7.1964693]
[4.8522964]]
In the above coding example one should realize that we no longer need to hardcode the functional form of the gradients. Instead we just plug-in the loss function and let tensorflow to do the gradient calculation for us.
Of course in actual development we will use higher-level APIs to implement our model, where the entire optimization process is abstracted away from the application code.
Both linear regression and logistic regression can be considered as simple additive model of the form:
\[ \hat{y} = \Phi\bigg(\sum_{i=1}^Pw_ix_i\bigg), \]
where \(P\) is the number of features used, \(x_i\) is the \(i\)-th feature, and \(\Phi(\cdot)\) is a function applied to the output. In linear regression \(\Phi(\cdot)\) is simply an identity function. In logistic regression \(\Phi(\cdot)\) is the standard sigmoid function.
Now consider there is a way to ensemble multiple such additive models together to generate a potentially better and more sophisticated model. We use the following diagram for illustration.
Registered S3 methods overwritten by 'ggplot2':
method from
[.quosures rlang
c.quosures rlang
print.quosures rlang
In the above diagram, \(Y_{11}\) is a single additive model
\[ Y_{11} = \Phi(W_{111}X_1 + W_{211}X_2 + W_{311}X_3). \]
Similarly \(Y_{12}\) is another such model (with the same input feature set but different model weights)
\[ Y_{12} = \Phi(W_{112}X_1 + W_{212}X_2 + W_{312}X_3). \]
Model \(Y_2\) is yet another additive model but takes the output of the above two models:
\[ Y_2 = \Phi(W_{1121}Y_{11} + W_{1221}Y_{12}). \]
The above setup is a simple architecture of a neural network model, with only one hidden layer of two neurons. (We also ignore the constant/bias term in each layer for simplicity.) A neuron is simply an additive model with a so-called activation function \(\Phi(\cdot)\) to transform the output from any real number into a scaled signal.
One now can easily realize that a logistic regression model could be viewed as a degenerated neural network model with single neuron and using sigmoid as the activation function. And a linear regression model is a degenerated neural network model with single neuron and without an activation function.
Why do we need the activation function? In the above neural network model in the absence of activation function the final output model \(Y_2\) will degenerate into a simple linear model. We can use simpler notations to demonstrate this:
\[ \begin{aligned} y_1 &= ax + b, \\ y_2 &= cx + d, \\ y_3 &= e + fy_1 + gy_2 \\ &= e + f(ax + b) + g(cx + d) \\ &= \underbrace{(e + fb + gd)}_\text{Bias} + \underbrace{(fa + gc)}_\text{Weight}x. \end{aligned} \]
Without activation function, no matter how many neurons or layers we design for our model, it eventually reduces to a simple linear model. With the activation function applied to each neuron, the model becomes non-linear and hence can handle much more complicated patterns hidden behind the data.
Some popular activation functions:
To solve for model weights in a neural network, we use a technique called backpropagation which is essentially an iterative process of gradient descent.
To simplify notation we assume each neuron is simply a univariate model. Consider the following minimum architecture:
Mathematically:
\[ \begin{aligned} \hat{y} &= \Phi(b_1 + w_1z) \\ &= \Phi(b_1 + w_1\Phi(b_0 + w_0x)), \end{aligned} \]
where
\[ z = \Phi(b_0 + w_0x), \]
and
\[ \Phi(t) = \frac{1}{1 + e^{-t}}. \]
Though it may not be very meaningful to use MSE loss when the output layer is applied with an activation function, we can still do it for educational purpose.
\[ \mbox{MSE-Loss} = \frac{1}{N}\sum_i^N (y_i - \hat{y}_i)^2. \]
Firstly we take the derivative w.r.t. the weight in the last layer:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_1} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_1}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i. \end{aligned} \]
Similarly for the bias in the last layer:
\[ \frac{\partial \mbox{MSE-Loss}}{\partial b_1} = - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)). \]
Now move on to the bias and weight in the first layer:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_0} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_0}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \underbrace{ \Phi(k)(1 - \Phi(k)) }_{k = b_0 +w_0x} \cdot w_1 \cdot x_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial b_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \Phi(k)(1 - \Phi(k)). \end{aligned} \]
One can clearly see there is a linkage between the derivative of the weights in consecutive layers:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \underbrace{\Phi(t)(1 - \Phi(t))}_{\hat{y}_i(1 - \hat{y}_i)} \cdot \underbrace{\Phi(k)(1 - \Phi(k)) \cdot w_1 \cdot x_i}_{\frac{\partial w_1z}{\partial w_0} = \frac{\partial w_1\Phi(b0 + w_0x)}{\partial w_0}}. \end{aligned} \]
Similarly we can derive the gradients for cross-entropy loss.
\[ \begin{aligned} \mbox{LogLoss} &= - \frac{1}{N}\sum_i^N \bigg[ y_i\ln\hat{y}_i + (1 - y_i)\ln(1 - \hat{y}_i)\bigg], \\ \frac{\partial \mbox{LogLoss}}{\partial w_1} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)z_i, \\ \frac{\partial \mbox{LogLoss}}{\partial w_0} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)\Phi(k)(1 - \Phi(k))w_1x_i.\\ \end{aligned} \]
(Skip the bias terms to save space.)
Now let’s implement the simple neural network model in Python:
def sigmoid(x):
return np.exp(-np.logaddexp(0, -x))
def sigmoid_grad(z):
return z*(1 - z)
class NeuralNetwork:
"""Simple neural network with 1 hidden layer of 4 neurons."""
def __init__(self, x, y):
self.input = x
self.w1 = np.random.rand(self.input.shape[1], 4)
self.w2 = np.random.rand(4, 1)
self.y = y
self.output = np.zeros(y.shape)
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.w1))
self.output = sigmoid(np.dot(self.layer1, self.w2))
def backprop(self, lr):
raise NotImplementedError
def train(self, lr, n_step):
for step in range(n_step):
self.feedforward()
self.backprop(lr=lr)
yhat = ",".join([str(y) for y in np.squeeze(self.output)])
return yhat
class NerualNetworkWithMSELoss(NeuralNetwork):
def backprop(self, lr):
# We ignore all constant terms in the gradient.
self.dy = self.y - self.output
dw2 = - np.dot(self.layer1.T, (self.dy*sigmoid_grad(self.output)))
dw1 = - np.dot(self.input.T,
(np.dot(self.dy*sigmoid_grad(self.output),
self.w2.T)*sigmoid_grad(self.layer1)))
self.w1 -= lr * dw1
self.w2 -= lr * dw2
class NerualNetworkWithLogLoss(NeuralNetwork):
def backprop(self, lr):
# We ignore all constant terms in the gradient.
self.dy = self.y - self.output
dw2 = - np.dot(self.layer1.T, self.dy)
dw1 = - np.dot(self.input.T,
np.dot(self.dy, self.w2.T)*sigmoid_grad(self.layer1))
self.w1 -= lr * dw1
self.w2 -= lr * dw2Here we will use the hello-world example of artificial neural network: The XOR problem. The XOR logical outcome, besides extremely simple, is not linearly separable. So it serves as a good example of showcasing neural networks’ non-linearity.
The input data is simply combination of two binary switches. (For completeness we also include a constant term which always evaluate to 1 as the first feature.) The output data is the XOR result.
X = np.array(
[[1, 0, 0],
[1, 1, 0],
[1, 0, 1],
[1, 1, 1]])
y = np.array([[0], [1], [1], [0]])
print(X) # Input features: Bias + two binary switches.[[1 0 0]
[1 1 0]
[1 0 1]
[1 1 1]]
[[0]
[1]
[1]
[0]]
Now let’s see if our simple neural network model can learn the XOR pattern.
nn_mse = NerualNetworkWithMSELoss(X, y)
nn_mse.train(lr=1, n_step=3000) # The predicted XOR outcome.'0.016509058781027677,0.9728542987218917,0.9784540441634975,0.027059912943542017'
[[-2.02258664 1.42223805 -1.71522187 -4.26403023]
[ 6.22204576 5.73421659 -0.35405646 2.29735218]
[ 5.86924511 -3.55734133 4.32911695 3.3323955 ]]
[[10.56290787]
[-5.62409836]
[-4.67738934]
[-5.56713542]]
'0.0001870860964935529,0.9994879226437576,0.9993958543889606,0.0006300370523400314'
For this simple example, both MSE and cross-entropy loss can work fine to figure out the XOR pattern. (Strictly speaking cross-entropy loss performs better.)
One can try change the sample size arbitrarily to see the behavior of the estimator.↩
Here we use the common derivatives: \(\frac{df(x)^n}{dx} = nf(x)^{n-1}f'(x)\) and \(\frac{de^{f(x)}}{dx} = f'(x)e^{f(x)}.\)↩
Here we use the common derivative \(\frac{d\ln f(x)}{dx} = \frac{f'(x)}{f(x)}\) and the chain rule to handle the term \(\frac{\partial\ln q_i}{\partial\beta_0}\) where \(q_i\) is the sigmoid computed for the \(i\)-th example.↩